This is an exploratory data analysis for red wines data set, This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The data can be downloaded here Also this text file explaining the data can be useful
## Observations: 1,599
## Variables: 13
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, ...
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660,...
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06,...
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2...
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075,...
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15...
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, ...
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0...
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30,...
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46,...
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, ...
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5,...
## variable mean std_dev variation_coef p_01 p_05
## 1 X 800.000 4.6e+02 0.5772 16.980 80.900
## 2 fixed.acidity 8.320 1.7e+00 0.2093 5.200 6.100
## 3 volatile.acidity 0.528 1.8e-01 0.3392 0.190 0.270
## 4 citric.acid 0.271 1.9e-01 0.7189 0.000 0.000
## 5 residual.sugar 2.539 1.4e+00 0.5554 1.400 1.590
## 6 chlorides 0.087 4.7e-02 0.5381 0.043 0.054
## 7 free.sulfur.dioxide 15.875 1.0e+01 0.6589 3.000 4.000
## 8 total.sulfur.dioxide 46.468 3.3e+01 0.7079 8.000 11.000
## 9 density 0.997 1.9e-03 0.0019 0.992 0.994
## 10 pH 3.311 1.5e-01 0.0466 2.930 3.060
## 11 sulphates 0.658 1.7e-01 0.2576 0.420 0.470
## 12 alcohol 10.423 1.1e+00 0.1022 9.000 9.200
## 13 quality 5.636 8.1e-01 0.1433 4.000 5.000
## p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr
## 1 400.50 800.000 1199.50 1519.10 1583.02 0.000 1.8 8.0e+02
## 2 7.10 7.900 9.20 11.80 13.30 0.982 4.1 2.1e+00
## 3 0.39 0.520 0.64 0.84 1.02 0.671 4.2 2.5e-01
## 4 0.09 0.260 0.42 0.60 0.70 0.318 2.2 3.3e-01
## 5 1.90 2.200 2.60 5.10 8.31 4.536 31.5 7.0e-01
## 6 0.07 0.079 0.09 0.13 0.36 5.675 44.6 2.0e-02
## 7 7.00 14.000 21.00 35.00 50.02 1.249 5.0 1.4e+01
## 8 22.00 38.000 62.00 112.10 145.00 1.514 6.8 4.0e+01
## 9 1.00 0.997 1.00 1.00 1.00 0.071 3.9 2.2e-03
## 10 3.21 3.310 3.40 3.57 3.70 0.194 3.8 1.9e-01
## 11 0.55 0.620 0.73 0.93 1.26 2.426 14.7 1.8e-01
## 12 9.50 10.200 11.10 12.50 13.40 0.860 3.2 1.6e+00
## 13 5.00 6.000 6.00 7.00 8.00 0.218 3.3 1.0e+00
## range_98 range_80
## 1 [16.98, 1583.02] [160.8, 1439.2]
## 2 [5.2, 13.3] [6.5, 10.7]
## 3 [0.19, 1.02] [0.31, 0.74]
## 4 [0, 0.7] [0.01, 0.52]
## 5 [1.4, 8.31] [1.7, 3.6]
## 6 [0.04, 0.36] [0.06, 0.11]
## 7 [3, 50.02] [5, 31]
## 8 [8, 145] [14, 93.2]
## 9 [0.99, 1] [0.99, 1]
## 10 [2.93, 3.7] [3.12, 3.51]
## 11 [0.42, 1.26] [0.5, 0.85]
## 12 [9, 13.4] [9.3, 12]
## 13 [4, 8] [5, 7]
| variable | mean | std_dev | variation_coef | p_01 | p_05 | p_25 | p_50 | p_75 | p_95 | p_99 | skewness | kurtosis | iqr | range_98 | range_80 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| X | 800.00 | 461.74 | 0.58 | 16.98 | 80.90 | 400.50 | 800.00 | 1199.50 | 1519.10 | 1583.02 | 0.00 | 1.8 | 799.00 | [16.98, 1583.02] | [160.8, 1439.2] |
| fixed.acidity | 8.32 | 1.74 | 0.21 | 5.20 | 6.10 | 7.10 | 7.90 | 9.20 | 11.80 | 13.30 | 0.98 | 4.1 | 2.10 | [5.2, 13.3] | [6.5, 10.7] |
| volatile.acidity | 0.53 | 0.18 | 0.34 | 0.19 | 0.27 | 0.39 | 0.52 | 0.64 | 0.84 | 1.02 | 0.67 | 4.2 | 0.25 | [0.19, 1.02] | [0.31, 0.74] |
| citric.acid | 0.27 | 0.19 | 0.72 | 0.00 | 0.00 | 0.09 | 0.26 | 0.42 | 0.60 | 0.70 | 0.32 | 2.2 | 0.33 | [0, 0.7] | [0.01, 0.52] |
| residual.sugar | 2.54 | 1.41 | 0.56 | 1.40 | 1.59 | 1.90 | 2.20 | 2.60 | 5.10 | 8.31 | 4.54 | 31.5 | 0.70 | [1.4, 8.31] | [1.7, 3.6] |
| chlorides | 0.09 | 0.05 | 0.54 | 0.04 | 0.05 | 0.07 | 0.08 | 0.09 | 0.13 | 0.36 | 5.68 | 44.6 | 0.02 | [0.04, 0.36] | [0.06, 0.11] |
| free.sulfur.dioxide | 15.87 | 10.46 | 0.66 | 3.00 | 4.00 | 7.00 | 14.00 | 21.00 | 35.00 | 50.02 | 1.25 | 5.0 | 14.00 | [3, 50.02] | [5, 31] |
| total.sulfur.dioxide | 46.47 | 32.90 | 0.71 | 8.00 | 11.00 | 22.00 | 38.00 | 62.00 | 112.10 | 145.00 | 1.51 | 6.8 | 40.00 | [8, 145] | [14, 93.2] |
| density | 1.00 | 0.00 | 0.00 | 0.99 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.07 | 3.9 | 0.00 | [0.99, 1] | [0.99, 1] |
| pH | 3.31 | 0.15 | 0.05 | 2.93 | 3.06 | 3.21 | 3.31 | 3.40 | 3.57 | 3.70 | 0.19 | 3.8 | 0.19 | [2.93, 3.7] | [3.12, 3.51] |
| sulphates | 0.66 | 0.17 | 0.26 | 0.42 | 0.47 | 0.55 | 0.62 | 0.73 | 0.93 | 1.26 | 2.43 | 14.7 | 0.18 | [0.42, 1.26] | [0.5, 0.85] |
| alcohol | 10.42 | 1.07 | 0.10 | 9.00 | 9.20 | 9.50 | 10.20 | 11.10 | 12.50 | 13.40 | 0.86 | 3.2 | 1.60 | [9, 13.4] | [9.3, 12] |
| quality | 5.64 | 0.81 | 0.14 | 4.00 | 5.00 | 5.00 | 6.00 | 6.00 | 7.00 | 8.00 | 0.22 | 3.3 | 1.00 | [4, 8] | [5, 7] |
From the info above now we know there are 1599 observations (rows) and 13 variables (columns) we can also see some statistical info for each of the variables if needed. Also the plot shows that the majority of the wines tested has a low residual sugar content, and low chlorides.
## How Many Wines in Each Rating Group?
| fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7.9 | 0.35 | 0.46 | 3.6 | 0.078 | 15 | 37 | 0.9973 | 3.35 | 0.86 | 12.8 | 8 |
| 10.3 | 0.32 | 0.45 | 6.4 | 0.073 | 5 | 13 | 0.9976 | 3.23 | 0.82 | 12.6 | 8 |
| 5.6 | 0.85 | 0.05 | 1.4 | 0.045 | 12 | 88 | 0.9924 | 3.56 | 0.82 | 12.9 | 8 |
| 12.6 | 0.31 | 0.72 | 2.2 | 0.072 | 6 | 29 | 0.9987 | 2.88 | 0.82 | 9.8 | 8 |
| 11.3 | 0.62 | 0.67 | 5.2 | 0.086 | 6 | 19 | 0.9988 | 3.22 | 0.69 | 13.4 | 8 |
| 9.4 | 0.3 | 0.56 | 2.8 | 0.08 | 6 | 17 | 0.9964 | 3.15 | 0.92 | 11.7 | 8 |
| 10.7 | 0.35 | 0.53 | 2.6 | 0.07 | 5 | 16 | 0.9972 | 3.15 | 0.65 | 11 | 8 |
| 10.7 | 0.35 | 0.53 | 2.6 | 0.07 | 5 | 16 | 0.9972 | 3.15 | 0.65 | 11 | 8 |
| 5 | 0.42 | 0.24 | 2 | 0.06 | 19 | 50 | 0.9917 | 3.72 | 0.74 | 14 | 8 |
| 7.8 | 0.57 | 0.09 | 2.3 | 0.065 | 34 | 45 | 0.99417 | 3.46 | 0.74 | 12.7 | 8 |
| 9.1 | 0.4 | 0.5 | 1.8 | 0.071 | 7 | 16 | 0.99462 | 3.21 | 0.69 | 12.5 | 8 |
| 10 | 0.26 | 0.54 | 1.9 | 0.083 | 42 | 74 | 0.99451 | 2.98 | 0.63 | 11.8 | 8 |
| 7.9 | 0.54 | 0.34 | 2.5 | 0.076 | 8 | 17 | 0.99235 | 3.2 | 0.72 | 13.1 | 8 |
| 8.6 | 0.42 | 0.39 | 1.8 | 0.068 | 6 | 12 | 0.99516 | 3.35 | 0.69 | 11.7 | 8 |
| 5.5 | 0.49 | 0.03 | 1.8 | 0.044 | 28 | 87 | 0.9908 | 3.5 | 0.82 | 14 | 8 |
| 7.2 | 0.33 | 0.33 | 1.7 | 0.061 | 3 | 13 | 0.996 | 3.23 | 1.1 | 10 | 8 |
| 7.2 | 0.38 | 0.31 | 2 | 0.056 | 15 | 29 | 0.99472 | 3.23 | 0.76 | 11.3 | 8 |
| 7.4 | 0.36 | 0.3 | 1.8 | 0.074 | 17 | 24 | 0.99419 | 3.24 | 0.7 | 11.4 | 8 |
| X | fixed.acidity | volatile.acidity | citric.acid | residual.sugar | chlorides | free.sulfur.dioxide | total.sulfur.dioxide | density | pH | sulphates | alcohol | quality |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 460 | 11.6 | 0.58 | 0.66 | 2.2 | 0.074 | 10 | 47 | 1.0008 | 3.25 | 0.57 | 9 | 3 |
| 518 | 10.4 | 0.61 | 0.49 | 2.1 | 0.2 | 5 | 16 | 0.9994 | 3.16 | 0.63 | 8.4 | 3 |
| 691 | 7.4 | 1.185 | 0 | 4.25 | 0.097 | 5 | 14 | 0.9966 | 3.63 | 0.54 | 10.7 | 3 |
| 833 | 10.4 | 0.44 | 0.42 | 1.5 | 0.145 | 34 | 48 | 0.99832 | 3.38 | 0.86 | 9.9 | 3 |
| 900 | 8.3 | 1.02 | 0.02 | 3.4 | 0.084 | 6 | 11 | 0.99892 | 3.48 | 0.49 | 11 | 3 |
| 1300 | 7.6 | 1.58 | 0 | 2.1 | 0.137 | 5 | 9 | 0.99476 | 3.5 | 0.4 | 10.9 | 3 |
| 1375 | 6.8 | 0.815 | 0 | 1.2 | 0.267 | 16 | 29 | 0.99471 | 3.32 | 0.51 | 9.8 | 3 |
| 1470 | 7.3 | 0.98 | 0.05 | 2.1 | 0.061 | 20 | 49 | 0.99705 | 3.31 | 0.55 | 9.7 | 3 |
| 1479 | 7.1 | 0.875 | 0.05 | 5.7 | 0.082 | 3 | 14 | 0.99808 | 3.4 | 0.52 | 10.2 | 3 |
| 1506 | 6.7 | 0.76 | 0.02 | 1.8 | 0.078 | 6 | 12 | 0.996 | 3.55 | 0.63 | 9.95 | 3 |
From the correlation matrix below we can see the 2 variables that has the highest correlation to quality are alcohol level and volatile acidity.
To answer this question and investigate more I am going to select the variables that has a correlation higher than 0.60 or -0.60
We see in the plot below that the higher the citric acid the higher the fixed acidity is
In the correlation section above we plotted a box plot for all variables in the data frame to show correlation below is the same chart for the correlation between Volatile Acidity and the wine rating.
We can summaries from the chart below that for wines that scored a rating of 8 the volatile acidity range is smaller (judged by the size of the box and whiskers in the plot) and the majority of the volatile acidity range for the wines rated 8 is between 0.49 and 0.33.
On the other hand the wines that scored 3 in the rating has a larger range in the volatile acidity and the majority falls between 1.2 and 0.61 which lead me to believe that lower volital acidity lead to higher rating
Final plot we are goint to look at from the correlation section is the alcohol to rating plot. This plot shows clear relationship between higher alcohol percentage and higher rating.
* The majority of the wines rated 9 has an alcohol level ranging from 11.30% to 12.90% * The majority of wines rated 3 has an alcohol level ranging from 9.70% to 10.70%
The data set contained rating and chemical properties of 1500 wines tested by experts and rated on a scale of 0-10(very bad - excellent), although the data set we have only has ratings from 3-10. I started by analyzing the data set and knowing what are the variables and the data type for each variable, then moved visualizing the ratings and the correlation between the data set variables.
There are some variables that has some correlation with the rating given to the wine. However, I don’t think the data is strong enough to suggest that a wine a=is rated higher due to a specific chemical property. I believe knowing the circumstances of the judges during the rating process, for example what kind of food have they eaten during the day of the rating? did the judges do multiple wines in the same day? all these questions and more can help us understand the data better and reach better conclusions.
Limitations I faced is not knowing how the rating was given and within what time period.